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Proteins are the molecules encoded by genes. The proteins m 
give rise to structure and, by virtue of their selective binding 
to other molecules, make genes and all the other machinery oj 



by Russell F. Doolittle 
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f DN A is the blueprint of life, then 
proteins are the bricks and mortar. 
Indeed, they serve also as the jigs 
and tools needed in the assembly of 
a cell or an organism, and they even 
play the role of the builders who car- 
ry out the work of assembly. Your 
genes supply the information, but you 
are your proteins. 

Like DN A, a protein is a linear poU 
ymer: a chain of subunits linked in 
a continuous sequence. In other re- 
spects, however, the two kinds of 
molecule are quite different. Roughly 
speaking, all DN A molecules are alike 
in overall structure, and they all have 
the same function (that of a genetic 
archive). Proteins, in contrast, fold up 
into a remarkable diversity of three- 
dimensional forms, which give them 
a corresponding variety of functions. 
They serve as structural components, 
as messengers and the receptors of 
messengers, as markers of individual 
identity and as weapons that attack 
cells bearing foreign markers. Some 
proteins bind to DN A and thereby reg- 
ulate the expression of genes; others 
take part in the replication, transcrip- 
tion and translation of genetic infor- 
mation. Perhaps the most important 
proteins are the enzymes, the catalysts 
that determine the pace and the course 
of all biochemistry. 

In the study of proteins a major aim 
has been to decipher their structure 
and so learn how they work. A com- 
plete structural analysis is a laborious 
undertaking, and up to now biochem- 
ists have gained a thorough under- 
standing of only a small fraction of the 
known proteins. Nevertheless, some 
general principles have emerged; sub- 
structures that are common to diverse 
proteins, and that probably have simi- 
lar functions in many of tfiem, can 
now be recognized. Of equal interest is 
the question of how the thousands of 
proteins in a typical organism have 
evolved and diversified. The presence 



of shared substructures implies a com- 
plex evolution. It is not simply a mat- 
ter of one protein's being modified 
and thus giving rise to another; rather, 
fragments of genetic information must 
somehow be exchanged and then ex- 
pressed in many proteins. 



Through all the functional diversity 
of proteins there runs a common 
thread: for the most part, proteins 
work by selectively binding to mole- 
cules. In the case of a structural pro- 
tein the binding often links identical 
molecules, so that many copies of the 
same protein aggregate to form a larg- 
er-scale structure such as a fiber, a 
sheet or a tubule. Other proteins have 
an affinity for a molecule different 
from themselves. Antibodies, for ex- 
ample, bind to specific antigens; hemo- 
globin binds to oxygen in the lungs and 
then releases it in distant tissues; regu- 
lators of genetic expression bind to 
specific patterns of nucleotide bases in 
DNA. Receptor proteins embedded in 
the cell membrane recognize messen- 
ger molecules (such as hormones and 
neurotransmitters), which may them- 
selves be proteins that have a specific 
affinity for the receptors. Virtually all 
the activities of proteins can be under- 
stood in terms of such selective chem- 
ical binding. 
The binding of a protein to the mole- 



cule it recognizes is not fixed or perma- 
nent. It is governed by a dynamic equi- 
librium, in which molecules are con- 
tinually being bound and released. At 
any instant the percentage of bound 
molecules depends on the relative 
amounts of the two substances present 
and on the strength of the association 
between them. The binding strength 
depends in turn on how well the mol- 
ecules fit together geometrically and 
on specific local interactions, such as 
electrostatic attraction or repulsion 
between charged regions. 

Enzymes, in this respect, are much 
like other proteins. An enzyme recog- 
nizes a specific molecule (called the 
substrate) and binds to it in dynamic 
equilibrium; what distinguishes an en- 
zyme is that it can bring about some 
chemical change in the bound sub- 
strate. The change generally entails 
the forming or breaking of a covalent 
chemical bond: the substrate may be 
split into two pieces, a chemical group 
may be added or the pattern of the 
bonds in the substrate may simply be 
rearranged. 

The mechanism of enzyme action 
can be viewed as having three stages. 
First the enzyme binds to the substrate, 
then the chemical reaction takes place 
and finally the altered substrate is re- 
leased. All three steps are reversible. If 
an enzyme binds to molecule X and 
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converts it into molecule Y, the same 
enzyme can also bind to Kand change 
it back into X. Indeed, there are many 
possible reaction paths. A molecule of 
either X or Y could be bound but re- 
leased before any change took place, 
or a molecule of X could be converted 
into Y and then changed back into X 
before it was released, and so on. 

It should be emphasized that the en- 
zyme itself does not determine the di- 
rection of the reaction. Tfie proportion 
of A' and Kat equilibrium depends on 
thermodynamic considerations; the fa- 
vored proportion is trie one that mini- 
mizes the quantity called free energy. 
(Roughly speaking, the free energy of 
a system is equal to its energy minus 
its entropy, or disorder.) The enzyme 



merely hastens the attainment of equi- 
librium. Nevertheless, an enzyme can 
effectively control the course of a bio- 
chemical process. In the' absence of an 
enzyme most biochemical reactions 
are extremely sluggish; the appropri- 
ate enzyme can speed them up by a 
factor of a million or more. Although 
the enzyme has no influence on wheth- 
er more X is converted into Y or vice 
versa, it determines whether or not the 
conversion takes place at all. 

An enzyme speeds a reaction by 
1 * lowering an energy barrier. Even 
when a reaction is thermodynamieal- 
ly favorable— when the products have 
a lower free energy than the reactants— 
there may be an intermediate state 



with a higher free energy. The enzyme 
tends to smooth this hump in the reac- 
tion path. The mechanism varies from 
case to case. Some enzymes merely 
provide an environment different from 
that of the aqueous medium, or they 
bring the reactants into close contact. 
Other enzymes take a more active role 
by adding or subtracting a proton, by 
straining bonds in the substrate mole- 
cule or even by forming transient co- 
valent bonds between the substrate 
and some part of the enzyme itself. 
Certain enzymes are helped by the ac- 
cessory molecules called coenzymes. 
The coenzyme binds to a specific site 
on the protein and provides chemical 
functions that are not available in the 
enzyme itself. 
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More than 2,000 enzymes have 
been identified on the basis of the 
chemical reactions they catalyze. All 
these proteins must be structurally 
distinct; in other words, proteins must 
come in at least 2,000 forms capable of 
recognizing specific molecules. How 
are these diverse structures generated? 
The "alphabet" from which proteins 
are made consists of the 20 amino ac- 
ids that can be specified in the genet- 




ic code; every protein is a sequence of 
amino acids drawn from this alphabet. 
The physical and chemical properties 
of a protein molecule depend on how 
the chain of amino acids folds up in 
three-dimensional space. 

All the information needed to define 
the three-dimensional structure of a 
protein is inherent in the amino acid 
sequence. As the chain is constructed 
on the ribosome, it folds up in the way 



that minimizes the free energy; in other 
words, the chain assumes its "most 
comfortable" configuration. In princi- 
ple, if one knew all the forces acting on 
the thousands of atoms in the protein 
and on the surrounding solvent mole- 
cules, one could predict the three-di- 
mensional structure from knowledge 
of the sequence alone. Such a calcula- 
tion is not now feasible. 
The 20 amino acids are all built on a 
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TWENTY AMINO ACIDS specified in the genetic code are the 
basic components of all proteins. Here the amino acids are shown 
joined head to tail to form a ring (which is not the structure of any 
real protein); their three-letter and one-letter abbreviations are in- 
dicated. The arrangement places amino acids that have similar 
chemical properties near one another in the ring. An approximate 
classification in five groups is based on the size of the amino acid's 



side chain and on the degree to which it is polarized. (A polar mole- 
cule has separated regions of positive and negative electric charge,) 
These factors have a major influence on the folding of a protein. 
In the evolution of a protein a mutant form is more likely to be 
accepted if an amino acid is replaced by one that has similar pro- 
perties—by one found nearby in the ring. The ring is similar to one 
proposed by Rosemarie M. Swanson of Texas A&M Universitv. 
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common foundation. They have an 
amino group (NH 2 ) at one end and a 
carboxylic acid group (COOH) at the 
other end; both groups are attached to 
a central carbon atom called the alpha 
carbon. Also attached to the alpha car- 
bon are a hydrogen atom and a fourth 
group called the side chain. It is only in 
the nature of the side chain that the 
amino acids differ from one another. 

The backbone of the protein is built 
by linking amino acids^iead to tail: tte 
amino group of one unit is joined to 
the carboxyl group of the next. The 
fusion is accomplished by removing a 
molecule of water, leaving the struc- 
ture -CO-NH-. The carbon-nitrogen 
linkage created in this way is called a 
peptide bond, and the protein chain is 
referred to as a polypeptide. 

The properties of the peptide bond 
impose certain constraints on the fold- 
ing of the protein. Electrons are shared 
among the oxygen, carbon and nitro- 
gen atoms in a way that gives the bond 
torsional stiffness; it resists rotation 
about its axis. As a result each peptide- 
bond unit lies in a plane, and the chain 
must fold almost entirely through ro- 
tations of the alpha-carbon bonds. The 
polypeptide backbone is not so much 
a flexible string of beads as it is an 
articulated chain of flat plates. 

nphe main influence on protein fold- 
ing comes from the properties of 
the side chains. Interactions of one side 
chain with another and with molecules 
in the medium can force the polypep- 
tide to fold up into a compact globule 
with a specific, stable shape. 

Some of the amino acids are polar 
molecules: although they are electri- 
cally neutral overall, they have local- 
ized concentrations of positive and 
negative charge. The polarization re- 
sults from the presence of oxygen or 
nitrogen atoms, which have a strong 
affinity for electrons. A few of the ami- 
no acids not only are polar but also 
carry a net electric charge; in other 
words, they are ionized .under physio- 
logical conditions. Other side chains 
(generally those made up exclusively 
of carbon and hydrogen) are nonpolar. 
There is a strong tendency for the po- 
lar side chains to seek a polar environ- 
ment and for the nonpolar ones to be 
segregated in nonpolar areas. Water, 
the medium in which most proteins 
are immersed, is a strongly polar sub- 
stance. When a polar or charged side 
chain projects into the aqueous envi- 
ronment, the water molecules assume 
an orderly arrangement. A nonpolar 
side chain in water disrupts this align- 
ment of charges. 

The chief consequence of these in- 
teractions is that a protein chain tends 
to fold so that polar side chains are on 
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AJLPHA HELIX AND BETA SHEET are common structural units of protein molecules. 
The sequence of amino acids in a protein is called its primary structure; as the chain is syn- 
thesized, regions of it fold spontaneously into alpha helixes and beta sheets, which constitute 
the secondary structure; the helixes and sheets are assembled in turn to create the tertiary 
structure. Both the alpha helix and the beta sheet are stabilized by hydrogen bonds (broken 
colored hnes), m which a hydrogen serves as a bridge between oxygen or nitrogen atoms. 
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the exposed surface and nonpolar ones 
are inside. An exception to this rule is 
found in proteins embedded m cell 
membranes. The membrane is made 
up of fatty, nonpolar molecules, and 
the segment of the protein that passes 
through it likewise consists mainly ot 
nonpolar amino acids. They anchor 
the protein in the membrane. 

The electrostatic attraction between 
a polar side chain and water is a form 
of hydrogen bonding, in which a hy- 
drogen atom acts as a bridge between 
charged oxygen or nitrogen atoms. 
Hydrogen bonding between one atom 
and another within the protein itself 
also helps to stabilize the structure^ 

Hydrogen bonds are weaker than 
the covalent bonds of the polypeptide 
backbone. Moreover, the atoms in a 
protein that are hydrogen-bonded to 
one another could as easily be hydro- 
gen-bonded to water; the energy differ- 
ence between the two configurations is 
small. Because many hydrogen bonds 
can form simultaneously as the protein 
folds, however, they contribute greatly 
to the stability of the structure. 

Still another form of bonding can 
cross-link regions of the molecule. The 
amino acid cysteine has a sulfhydryl 
(SH) group at the end of its side chain. 
If the protein includes two cysteine 
units, they can combine to form a co- 
valent disulfide bond (-S-S-). Such 
cross-links are much stronger than hy- 
drogen bonds. 

The amino acid sequence of a pro- 
tein is called its primary struc- 
ture The complete three-dimensional 
conformation of a single polypeptide 
strand is referred to as the tertiary 
structure. As these terms suggest, there 
is an intermediate level of organiza 
tion called the secondary structure. 1 
describes the local folding of the chan 
in terms of structural units that appea 
in almost all proteins. 

Some 35 years ago Linus Paulin; 
showed that the protein backbone cai 
be coiled into a tight helix stabilized b 
numerous hydrogen bonds; he calle 
the structure the alpha helix. The heli 
makes one turn for every 3.6 amrn 
acids, and hydrogen bonds form b* 
tween amino acids four units apar 
The bonds do not involve the sic 
chains but rather extend from the N. 
group of one peptide unit to the C 
group of another; for this reason tl 
stability of the helix is not strongly d 
pendent on the identity of the su 
chains, and many different sequenc 
of amino acids can spontaneously i 
sume the form of an alpha helix. 

At about the same time, Pault 
proposed a second stable configui 
tion he designated the beta sheet, 
this case lengths of polypeptide chi 
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SECONDARY STRUCTURE of alcohol dehydrogenase consists of numerous alpha he- 
lixes and beta sheets connected by short lengths of "random" structure. The NAD-binding 
domain is in green and yellow; the catalytic domain, which binds to an alcohol molecule, is in 
blue. A bound NAD molecule is shown in purple near the junction of the protein domains. 



lie next to one another and run either 
parallel or antiparallel, with hydrogen 
bonds connecting the adjacent strands. 
Again the bonds join the NH and CO 
groups of the backbone. 

Some proteins are composed mostly 
of alpha helix and others are predomi- 
nantly beta sheet. In a typical globular 
protein the interior is a bundle of beta 
strands running back and forth dia- 
metrically and the surface is covered 
with alpha helixes. Tfte exterior helixes 
generally show a characteristic period- 
icity in amino acid sequence. Nonpo- 
lar side chains appear at every third 
or fourth position and are directed 
toward the interior of the molecule; 
the rest of the side chains, which are 
exposed to the aqueous environment, 
tend to be polar. 

In recent years still another interme- 
diate level of protein structure has 
been perceived. For example, a struc- 
tural element present in numerous pro- 
teins consists of two beta strands con- 
nected by a segment of alpha helix. 
The three pieces nestle together com- 
fortably when they are arranged at 
particular angles. A structural feature 
of this kind, which typically encom- 
passes from 30 to 150 amino acids, is 
called a domain. It can be considered a 
single unit because its conformation is 
determined almost entirely by its own 
amino acid sequence. The beta-alpha- 
beta domain is of particular impor- 
tance because when two such domains 
lie next to each other, the crevice they 
form often serves as a binding site. 

A typical globular protein includes 
about 350 amino acids, which could 
fold in innumerable ways. The hierar- 
chy of larger-scale structures brings a 
measure of order. Local interactions 
between nearby amino acids give rise 
to alpha helixes, beta sheets or other 
forms of secondary structure. These 
subassemblies, acting as more or less 
coherent units, organize themselves 
into domains. The geometric arrange- 
ment of the domains constitutes the 
tertiary structure. The presence of 
the same secondary structures and do- 
mains in many dissimilar proteins ar- 
gues that they are not mere artificial 
abstractions introduced by the bio- 
chemist; on the contrary, they seem to 
be fundamental units in the evolution 
and diversification of proteins. 

Many proteins have a level of orga- 
nization beyond the tertiary structure. 
They are composed of multiple poly- 
peptide strands held together by a va- 
riety of weak bonds and sometimes 
further cemented by disulfide linkages. 
Some proteins also have nonpeptide 
components. Metal ions, for example, 
are essential to the activity of certain 
enzymes, and a structure called the 



porphyrin ring is found in hemoglobin, 
chlorophyll and a number of other 
proteins. Many proteins are also "dec- 
orated" on their surface with chains of 
sugar molecules. These additional fea- 
tures of protein structure are elabora- 
tions of the molecule added after the 
polypeptides are synthesized. 

In a way it is remarkable that any 
protein consistently assumes a single, 
well-defined conformation. The folded 
state does have a lower free energy 
than any alternative configuration, but 
the difference is small. In an alpha he- 
lix hydrogen bonding between peptide 
units reduces the energy, but if the 
helix were unraveled, the same sites 
would form hydrogen bonds with wa- 
ter. Furthermore, because a helix is an 
ordered structure, it has a low entropy, 
which tends to increase the free ener- 
gy. It is worth noting that not all poly- 
peptides have a stable folded pat- 
tern. Artificially constructed random 
sequences of amino acids are general- 
ly loose, flexible coils that continually 
shift from one structure to another. 
The proteins found in biological sys- 
tems appear to be a subset of poly- 
peptides selected for their stability of 
structure. 

How is the structure of proteins dis- 
covered? Of all the methods employed 
by the protein chemist, the most re- 



vealing has been X-ray crystallogra- 
phy. The basic idea is to form a dif- 
fraction pattern by passing X rays 
through a crystallized specimen of the 
protein. Because of the periodic struc- 
ture of the crystal, the pattern is essen- 
tially the same as the one that would be 
generated if a single molecule could 
be examined. From the diffraction pat- 
tern one constructs a map showing 
the density of electrons in the protein, 
and from the map the path of the back- 
bone and the positions of the side 
chains can be inferred. 

X-ray crystallography provides a 
three-d imensional view and shows 
a protein in atomic detail. It is through 
such studies that the main themes of 
protein structure have been elucidat- 
ed: that the interior is filled with non- 
polar side chains, that the alpha helix 
and beta sheet are more than hypo- 
thetical structures, that most proteins 
are compact globules with dimpled 
surfaces, and much more. Crystallog- 
raphers also confirmed the existence 
of domains and discovered common 
patterns among them. 

Ideally the three-dimensional struc- 
ture of all proteins would be studied by 
X-ray crystallography, but that is not 
feasible. Crystallizing a protein in the 
first place often calls for a good deal of 
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EVOLUTIONARY DISTANCE (AMINO ACID REPLACEMENTS) 
EVOLUTION OF PROTEINS can be traced by comparing amino acid ^^^T^ 



chemical wizardry, and the subsequent 
analysis of diffraction patterns is ardu- 
ous. It took 23 years to map the struc- 
ture of hemoglobin. Up to now the 
three-dimensional structures of only 
about 100 proteins have been solved. 

There was a time, in the 1950's, 
when merely determining the amino 
acid sequence of a protein was also a 
difficult and laborious procedure, even 
for a very small protein. First the to- 
tal composition of the protein was 
found by breaking all the peptide 
bonds in a sample of the material and 
measuring the amount of each amino 
acid present. Other samples were only 
partially digested, leaving small frag- 
ments whose amino acid content could 
then be analyzed in turn. A special 
chemical trick revealed which ami- 
no acid was at the amino end of each 
fragment. Having gathered informa- 
tion on many overlapping fragments, 
the biochemist could attempt to solve 
the elaborate puzzle of how the pieces 
fit together. 

In the 1960's the technology of se- 
quence analysis improved dramati- 
cally, and by 1970 the procedure had 



been automated. Amino acids were re- 
moved one at a time from the amino 
end of the chain and identified. A limit 
remained, however, on the maximum 
length of chain that could be handled, 
and so large proteins still had to be 
broken down into fragments. 

In the past few years an indirect 
method of sequence analysis has all 
but supplanted the traditional tech- 
niques of protein chemistry. The key 
to the new method is that the nucleo- 
tide sequence of a DNA molecule is 
much easier to determine than the ami- 
no acid sequence of a protein. If one 
has a length of DNA that is known to 
encode the structure of the protein, it is 
a simple matter to sequence the DNA 
and translate each three-base codon 
into the corresponding amino acid. 

The one difficult operation is finding 
the DNA that encodes the protein. 
In one strategy the first step is to ana- 
lyze about 25 amino acids at the ammo 
end of the protein. An appropriate seg- 
ment of from five to seven ammo acids 
within this range is then "back-trans- 
lated" into a nucleotide sequence. This 



process is not without ambiguity: al- 
though every codon specifies exactly 
one amino acid, most of the amino 
acids can be specified by more than 
one codon. The key is to choose a se- 
quence that has as little ambiguity as 
possible and then to produce a DNA 
molecule for each possible back trans- 
lation. If the amino acids include a his- 
tidine unit, for example, DNA is made 
with both CAT and CAC codons at 
the appropriate position; these are the 
two codons that specify histidine. 

The back-translated DNA serves as 
a probe to find complementary DNA 
sequences. It is labeled with a radio- 
active isotope of phosphorus and al- 
lowed to hybridize with the DNA in a 
library of cloned gene fragments. The 
clones with a matching sequence are 
readily identified by the presence of 
the radioactive phosphorus; they are 
isolated, cultured in quantity and then 
sequenced. The approach may seem 
roundabout, but it is simpler and more 
accurate than direct chemical analy- 
sis of protein fragments. Some 4,000 
polypeptide sequences are now known. 

The geneticist Theodosius Dobzhan- 
sky once wrote: "Nothing in biology 
makes sense except in the light of evo- 
lution." The same is true of protein 
structure: it makes sense only in terms 
of protein evolution. Just as all living 
organisms surely trace their lineage to 
a few progenitors, the great majority 
of proteins must be descended from a 
very small number of archetypes. 

Evidence supporting this assertion 
comes from many quarters, and I 
shall defend it only briefly. The most 
straightforward argument is the mani- 
fest difficulty of "inventing" a protein 
de novo. As pointed out above, most 
random polypeptides do not even fold., 
much less exhibit a biological func- 
tion; a new protein is far more likely tc 
arise from modification of an existing 
one. There is abundant evidence oi 
this process in specific amino acid se 
quences that are encoded by more thai 
one segment of DNA in a given ge 
nome. Moreover, in proteins that fol< 
to form localized domains crystallog 
raphers consistently find the same pat 
terns in varied settings; once a sub 
structure has proved useful, it seems t 
be called on repeatedly. 

The primary mechanism of protei 
evolution is gene duplication, in whic 
a cell comes to include two copies (c 
more) of a single gene. One copy r< 
tains its original function, so that th 
organism's viability is not comprc 
mised by the lack of an essential pn 
tein. The redundant copy is thereto] 
free to mutate without constraint fro: 
natural selection. Most mutations ge: 
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erate a nonfunctional protein, but an 
occasional advantageous change can 
create either an improved version of 
the original protein or a protein with 
an entirely new function. 

There are two aspects to the study of 
protein evolution, which must be care- 
fully distinguished. One can exam- 
ine the "same" protein in various spe- 
cies, observing how the structure has 
changed over the course of biological 
time. For example, ffie amino acid se- 
quence of cytochrome c, a protein that 
transfers electrons in metabolism, has 
been determined for more than 80 spe- 
cies, from bacteria to man. One prod- 
uct of such studies is a taxonomy of the 
organisms based on the relations of 
their proteins. The other approach is to 
compare the structures of various pro- 
teins within a single species. From this 
endeavor one can construct the family 
tree of the proteins themselves. 

Comparisons from species to spe- 
cies offer considerable insight into pro- 
tein chemistry. Between closely relat- 
ed organisms the commonest changes 
substitute one amino acid for another 
with similar properties, so that the 
overall structure of the molecule is 
not disrupted. As the evolutionary dis- 
tance between the species increases, 
the sequences diverge. Ultimately the 
consanguinity of the sequences may be 
undetectable, even though the two pro- 
teins are unmistakably alike in tertiary 
structure. What this means is that com- 
pletely different amino acid sequences 
can fold into the same shape. 



In comparing different proteins with- 
in a single species it soon becomes ob- 
vious there are broad families of re- 
lated molecules. The half-dozen poly- 
peptides that make up various forms 
of hemoglobin, for example, and the 
single polypeptide of myoglobin all 
share clear similarities. They are not 
only analogous (meaning they are sim- 
ilar in function) but also homologous 
(meaning they derive from a common 
ancestor). Among the enzymes it is not 
surprising that those catalyzing simi- 
lar reactions often have homologous 
sequences. Glutathione reductase and 
lipoamide reductase provide an illus- 
trative example. Both enzymes cata- 
lyze the transfer of hydrogen ions to 
sulfur-bearing compounds; they are 
identical at more than 40 percent of 
their amino acid positions. A similar 
degree of homology is evident between 
chymotrypsinogen and trypsinogen 
and between ornithine transcarbamy- 
lase and aspartate transcarbamylase, 

As the kinship between proteins 
\ grows more remote, sequence ho- 
mology becomes harder to detect. 
Worse, the arithmetic of sequence 
comparison is such that unrelated se- 
quences may appear tantalizingly sim- 
ilar. Offhand, one might expect two 
randomly chosen polypeptides to be 
identical at about 5 percent of their 
amino acid positions; after all, there 
are 20 amino acids. If the comparison 
could be made by simply writing down 
the sequences one above the other and 



then ticking off the matches, the 5 per- 
cent limit would apply, but in reality a 
more sophisticated method is'needed. 

A protein can be altered not only by 
the substitution of one amino acid for 
another but also by the deletion or in- 
sertion of amino acids. Suppose two 
proteins are identical except that one 
has lost its first amino acid; if no allow- 
ance were made for this deletion, the 
proteins would appear to be unrelated. 
On the other hand, if unlimited gaps 
and insertions were allowed, any two 
proteins could be forced to match arbi- 
trarily well. In practice the sequence 
comparison is done with a comput- 
er program that rewards matches be- 
tween identical or similar amino acids 
and imposes penalties for gaps and in- 
sertions. Even so, it is virtually impos- 
sible to distinguish between chance 
similarity and common ancestry when 
the number of identical positions falls 
below about 15 percent. 

In tracing the genealogy of proteins 
the relations of greatest interest are 
those between sequences that (after 
adjustment for gaps and insertions) are 
between 15 and 25 percent identical. 
This "twilight zone" is where one must 
look for the roots of the protein family 
tree, to find molecules that diverged 
early in the course of their evolution. 

In the early 1960*s it became clear 
that a repository of amino acid se- 
quences v' r ould facilitate studies of 
protein evolution, and in 1965 Richard 
Eck and Margaret O. Dayhoff issued 
the first volume of the Atlas for Protein 
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MOLECULAR PALEONTOLOGY reveals a pattern of common 
ancestry for five proteins from diverse species. Each protein is rep- 
resented by a sequence of the one-letter abbreviations for amino 
acids given in the illustration on page 40; the colors relate amino 
acids with similar properties. Dashes indicate gaps or insertions. 
Cysteine units, which can form cross-links that stabilize the folded 
structure, are marked by boxes. Ovalbumin is an abundant protein 



in egg white; anti thrombin HI and alpha-1 antitrypsin are found in 
blood plasma; barley protein Z was recently discovered in barley 
seeds, and angiotensinogen is the precursor of a small protein that 
regulates blood pressure. Both antithrombin DTI and alpha-1 anti- 
trypsin are known to act as inhibitors of proteases (enzymes that 
cut protein chains). The functions of the other proteins had not 
been known, but now it seems they too may be protease inhibitors. 
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COMMON SEQUENCE embedded within six disparate proteins 
suggests they may have shared genetic information at some point 
in their evolution. Only a segment of each protein is shown; it cor- 
responds to a single identifiable domain. The similarities within the 



domain are unmistakable, even though some of the proteins 
greatly elsewhere in their structures. It appears that DNA < 
ing the domain has been copied from gene to gene. The protei 
all recent products of evolution, found only in vertebrate ar 



Sequence and Structure, Their goal was 
to publish annually "all the sequences 
that could fit between a single pair 
of covers." It soon became apparent, 
however, that the covers would have to 
be very far apart, and computer tapes 
began to replace bound volumes as the 
working medium of the sequence com- 
parer. Today any investigator can gain 
access to large sequence banks from a 
computer terminal. 

About 10 years ago, working with a 
tape of data from the Atlas, I began to 
study the phylogeny of certain pro- 
teins. I was soon maintaining my own 
data bank, and whenever a new se- 
quence was reported, I would enter it 
in the archive to see if it resembled 
anything already known. The number 
of matches was surprisingly large. 

I should like to give an example of 
how this molecular paleontology 
works. In the late 1970's Staffan Mag- 
nusson and his co-workers aj: the Uni- 
versity of Aarhus in Denmark deter- 
mined the amino acid sequence of anti- 
thrombin III, a protein in the blood 
plasma of vertebrate animals. Anti- 
thrombin III neutralizes thrombin, a 
blood-clotting factor whose mode of 
action is that of a protease, or protein- 
cutting enzyme. At about the same 
time a second group reported the se- 
quence of alpha- 1 antitrypsin, another 
protease inhibitor in the blood plasma. 
The Danish group compared the two 
sequences and found they were identi- 
cal at 120 of 390 sites, a homology of 
about 30 percent. It seemed obvious 
they had descended from a common 
ancestral protein. 

Not long after, workers at the Na- 
tional Biomedical Research Founda- 
tion at Georgetown University entered 
into their computer the sequence of 
ovalbumin, a protein abundant in egg 
white. They found that it resembles an- 
tithrombin III and alpha- 1 antitrypsin, 
again to the extent of about 30 percent. 
The discovery came as a surprise, be- 
cause up to then no one had any idea 
what the function of ovalbumin might 
be. The possibility that it is a protease 
inhibitor now had to be considered. 

In 1983 a Japanese group published 
the sequence of angiotensinogen, the 



precursor of a small peptide hormone 
that regulates blood pressure. Al- 
though the hormone itself is only 10 
amino acids long, the precursor ex- 
tends to about 400 units. When I com- 
pared the sequence of angiotensinogen 
with the sequences in my data bank, 
the search revealed a low-level resem- 
blance to alpha- 1 antitrypsin. The re- 
semblance was one of those in-the twi- 
light zone, amounting to only a 20 
percent identity, but a statistical anal- 
ysis convinced me the two proteins are 
members of the same family. Since 
then corroborating observations have 
been made by others, and there is no 
doubt of the kinship. 

Another Danish group has recently 
added a fifth branch to this unexpected 
tree of related proteins: it is a sub- 
stance of unknown function found in 
barley seeds and called protein Z. Al- 
though protein Z is only half the size 
of the others (about 200 amino acids), 
it is clearly related to them. Indeed, 
the half size fits well with experimen- 
tal findings that the other proteins in 
the family have two major domains. 

The discovery of these five related 
proteins in diverse settings suggests 
two lessons. First, whether or not the 
4,000 amino acid sequences known to- 
day represent a significant fraction of 
all proteins, a point has been reached 
where any newly determined sequence 
has a good chance of resembling one 
already on record. Second, certain 
large-scale arrangements of amino ac- 
ids are so useful in biochemistry that 
they have been employed over and 
over again in different contexts. Often 
these functional units can be identified 
with the domains recognized in struc- 
tural studies. 

One of the most widely distributed 
domains was discovered in 1974 
by Michael G. Rossmann and his col- 
leagues at Purdue University. They 
noted from X-ray-diffraction maps 
that several enzymes had an important 
feature in common: even though the 
overall structures of the proteins were 
quite different, they all included a do- 
main of about 70 amino acids with es- 
sentially the same folding pattern. The 
enzymes also differed greatly in func- 



tion, but they had in common the 
ty to bind certain coenzymes, r 
ly nicotinamide adenine dinuclc 
(NAD), flavin mononucleotide (F 
or adenosine monophosphate (A 
All these molecules include a n 
nucleotide within their structure 
ubiquitous domain in the enzyr. 
the binding site for the mononi 
tides, and Rossmann named it the 
onucleotide fold. 

The discovery led Rossmann 
bold hypothesis. The domain foi 
all these enzymes, he proposed, 
ghost of a primitive protein fron 
cellular times. Its ability to bind r 
otides was so important that i 
incorporated into the machine 
several of the prototype enzyme 
emerged in the first living system 
still recognizable today. 

The model is an attractive o 
seems likely that the first functi 
proteins were small and that the 
portant capability was the bindi 
other molecules. If two small pn 
able to bind two different small : 
cules were joined, the rudiments ■ 
talysis could be initiated. Once 
ed, a succession of gene duplic; 
could lead to an extended fam: 
stable proteins. In the early stag* 
loosely gathered proteins wou! 
clumsy and inefficient. Opportu 
for improvement would be n\ 
ous, however, and the natural 
tion of mutant structures would 
inate events. Eventually the pr< 
would be so artfully suited to 
function and the enzymes so eff 
that "natural rejection" of mutani 
ants would prevail. 

Not long after Rossmann suj 
ed that primitive proteins j 
have been created by the fusion o 
ful domains, recombinant-DNA 
ies led to the startling discover) 
eukaryotic genes are not contin 
They consist of segments that ei 
part of a protein's structure (e 
separated by long stretches of no 
ing DNA (introns). In some cast 
introns were observed to fall at oj 
the boundaries of a protein's don 
This correspondence led Walte 
bert of Harvard University to pn 
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that exons are the genomic equivalent 
of the interchangeable protein parts 
hypothesized by Rossmann. In Gil- 
bert's view, not only were the first pro- 
teins created by the assembly of stable 
domains but also evolution had main- 
tained the genetic isolation of the do- 
mains over the course of several bil- 
lion years. It is easy to see how this 
genomic organization might convey 
an adaptive advantage: the continuing 
reassembly of domains in new combi- 
nations would give rise to novel, and 
occasionally useful, proteins. Gilbert's 
ideas have been widely accepted, 
although there are also counterargu- 
ments. In many eukaryotic proteins 
introns fall at places other than obvi- 
ous domain boundaries. Furthermore, 
prokaryotic genes have no introns at 
all; it is necessary to suppose they were 
eliminated in the interest of genomic 
economy. 

Lately another remarkable instance 
of the dispersal of domains. throughout 
a group of proteins has come to light. 
In this case the proteins are all recent 
products of evolution; they are found 
only in vertebrate animals that arose 
well within the past billion years. 
Moreover,' the distribution of the do- 
mains among these proteins cannot 
readily be explained as a simple result 
of descent from a common ancestor. 
One domain is present in 18 copies 
scattered throughout six proteins. It 
seems clear these subunits have been 
passed freely from one protein to an- 
other and have been inserted wherever 
their functional activity is needed. For 
several of the proteins it has been 
shown that the DNA coding for the 
domains is precisely delimited by in- 
trons. In these cases there can be no 
question that the organization of the 
genome into exons and introns has 
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SHUFFLING OF MULTIPLE DOMAINS is evidence of the continual diffusion of ge- 
netic information in higher organisms. Five domains are represented by various geometric 
symbols, and their distribution is shown in six proteins. The domain whose sequence is given 
in the illustration on the opposite page is the one marked here by a cross. The distribution 
of the domains cannot readily be explained by assuming that all the domains in a given pro- 
tein were inherited from the same ancestral gene; instead they seem to have spread from 
one protein to another by chromosomal rearrangements. In several cases boundaries be- 
tween domains in the protein correspond to boundaries between exons and introns in the ge- 
nome, which may have facilitated the shuffling of gene segments in the. course of evolution. 



been instrumental in the rearrange- 
ment of the mosaic gene products. 

Does this confirm Gilbert's hypothe- 
sis that exonic shuffling of domains has 
been a major feature of protein evolu- 
tion from the earliest times? Although 
such shuffling is certainly going on 
now, I think it is a mistake to assume 
the same mechanism was at work in 
more primitive organisms. Introns can 
be dealt with in eukaryotic genes only 
because sophisticated splicing machin- 
ery ensures that the pieces of messen- 
ger RNA are properly translated into 



protein. It seems unlikely the same ap- 
paratus could have been present in the 
earliest life forms. Exon exchange in 
the mosaic vertebrate proteins is more 
likely a reenactment of ancient events, 
but in a totally modern guise. 

Such variations on a theme are to be 
expected in a system as complex as the 
living cell, where a change in one mol- 
ecule can affect thousands of others, 
including the very machinery responsi- 
ble for synthesizing the first molecule. 
Just as proteins evolve, so do the 
mechanisms of protein evolution. 



